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Abstract 

In this paper we present a generative model for protein contact networks. The soundness of the 
proposed model is investigated by focusing primarily on mesoscopic properties elaborated from the spectra 
of the graph Laplacian. To complement the analysis, we study also classical topological descriptors, such 
as statistics of the shortest paths and the important feature of modularity. Our experiments show 
that the proposed model results in a considerable improvement with respect to two suitably chosen 
generative mechanisms, mimicking with better approximation real protein contact networks in terms of 
diffusion properties elaborated from the Laplacian spectra. However, as well as the other considered 
models, it does not reproduce with sufficient accuracy the shortest paths structure. To compensate this 
drawback, we designed a second step involving a targeted edge reconfiguration process. The ensemble 
of reconfigured networks denotes improvements that are statistically significant. As a byproduct of our 
study, we demonstrate that modularity, a well-known property of proteins, does not entirely explain the 
actual network architecture characterizing protein contact networks. In fact, we conclude that modularity, 
intended as a quantification of an underlying community structure, should be considered as an emergent 
property of the structural organization of proteins. Interestingly, such a property is suitably optimized 
in protein contact networks together with the feature of path efficiency. 

Keywords — Protein contact network; Generative model; Graph Laplacian; Mesoscopic analysis. 


1 Introduction 

Protein contact networks (PCNs) are minimalistic models of protein 3D structures, which collapse the full- 
rank information of 3D coordinates of each atom into an adjacency matrix providing the pairwise contacts 
between residues [Ml EH IMsll Eg EH EQl ES] ■ A contact is scored if the Euclidean distance between alpha 
carbons of each residue pairs is within two van der Waals radii. PCNs allow for a reliable reconstruction 
of the global protein structure |S2]. Moreover PCNs allow for an efficient description of relevant biological 
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properties of proteins such as allosteric effect and identification of active sites [5D]. The efficiency of PCNs 
in retaining the essential features of protein structures makes the development of a PCN generative model of 
utmost importance for shedding light on both folding process and the structural bases of the unique functional 
properties of protein molecules H [Ml lini IMl- Although many generative models have been developed in 
the still young network science discipline [71117], fewer and less established examples are available in the 
literature when focusing on formal representations of protein molecules [IISIIHIIMIIISIISIIIIISHI- It is 
possible to cite approaches for modeling proteins based on knot theory [M] and topology [SSI EH Eli- Other 
approaches focus on approximating particular protein structures, such as the local structure [8] and specific 
fold families [2|- To the best of our knowledge, only Refs. 0123] studied the problem of generating PCNs 
for evaluating detailed graph-theoretical characteristics. 

The quest for a reliable and, most importantly, justifiable generative model for PCNs implies as a first 
step the identification of a target function. This allows for a unambiguous evaluation of the proposed model 
in terms of “superposition” of the simulated contact networks with the real PCNs. Inspired by the seminal 
works by Leitner [34] > we considered here as target function to approximate of the peculiar heat trace decay 
of PCNs [3H]. Such a property is elaborated from the heat kernel miEiiEa, the graph-theoretical analogue 
of the well-known first-order differential equation describing diffusion of heat in a physical medium. The 
heat trace (HT) is an invariant property - in the graph-theoretical sense - elaborated from the spectrum 
of normalized graph Laplacian [niiiiiiaiii]- Graph Laplacians are objective of focused studies in many 
scientific fields, as in fact it is possible to extrapolate many structural and dynamical properties from such a 
matrix representation of the system [IBESlElinT]- The interest in the study of graph Laplacians motivated 
also related researches focusing on the so-called spectral reconstruction, i.e., on calculating a graph given 
a specific spectrum to be considered as target [HdH]- Although potentially interesting, co-spectrality of 
graphs is a very hard problem and it is still a not very well-developed field in graph theory ED, limiting 
hence its practical exploitation. 

Inside a protein molecule, energy readily flows between distant regions connected by inter-module contacts 
(fast lane) and only slowly within modules (slow lane). In other words, proteins present either a strong 
modularity, causing a low thermal dissipation (heat is kept into modules by the richness of dead ends 
pathways slowing down the spreading of energy), and a suitable number of long-range shortcuts, allowing 
for a rapid and yet efficient communication between distant sites responsible for allosteric effect [34] ■ This 
dual behavior is at the basis of two crucial properties for protein physiology: 1) keeping a stable micro¬ 
environment for the catalyzed reaction (slow decay) and 2) allowing for an efficient spreading of information 
throughout the molecule to get rid of environmental changes (binding of an allosteric effector, pH changes, 
etc). There is a clear trade-off between the two above goals: increasing modularity is good for the first goal 
but is detrimental for the latter; on the other hand characteristic length minimization is an efficient strategy 
for the latter but is detrimental for the former. 

The contribution of this paper consists in a two-step generative model for PCNs; the first stage of our 
method takes inspiration from the work of Bartoli et al. I^. The dataset considered in our study consists 
of four ensembles (classes) of networks: i) actual PCNs elaborated from the E. coli proteome [MlE7]i ii) 
synthetic networks generated according to the recipe of Bartoli et al. [S], hi) synthetic modular networks 
generated with the method proposed by Sah et al. ET], and finally iv) those generated with our method. 
We evaluate the soundness of the proposed approach by focusing on mesoscopic analyses. Notably we 
first study characteristics elaborated from the normalized Laplacian spectra. To complement this spectral 
analysis, we also study topological descriptors, including statistics of the shortest paths and quantification 
of the network modularity. Results show that the ensemble of networks generated with our method ends up 
in a statistically significant improvement of similarity with real PCNs, as for both spectral and topological 
properties. However, a principal component analysis (PCA) of the considered topological descriptors revealed 
a gap with actual PCNs, specifically related to the shortest paths. The second step of the proposed method 
is hence designed to compensate this drawback. A new version of the ensemble of networks is then obtained 
by rewiring edges with high edge-betweenness |16j . As a result, we show that we are able to achieve a further 
statistically significant improvement of the ensemble characteristics, without altering the global spectral 
properties of the first ensemble that we generated. 
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As a byproduct of our study, we demonstrate that modularity, a well-known feature found in proteins 
as well as in many other biological networks, is not sufficient to explain the underlying network architecture 
of PCNs. This result is of particular interest, since it stresses the peculiar architecture of proteins that 
suitably merges conflicting features such as path efficiency and modularity. The topological properties of 
the obtained graphs, exploited by PCA, allowed us to offer a clear structural counterpart to HT features. 
These structural counterparts are by no means confined to protein science, given that well-known relations 
between small-world, fractal, and modular architectures with path efficiency are present also in brain 

networks [9] and are likely to constitute relevant features to be optimized in other natural and/or artificial 
networked systems. 

The remainder of this paper is structured as follows. In Sectionj^we describe the data that we considered 
in our study. In Section we show and discuss the obtained results. Section offers the conclusions and 
pointers to future research works. The proposed generative algorithm is fully described in the Methods 

the considered graph characterizations are instead introduced in Section 


section 

E3 


notably in Section A.l 


2 Dataset 

In our study we consider four ensembles of networks, three of which are created according to specific gener¬ 
ative models. Each ensemble contains 100 networks of varying size (from 300 to 1000 vertices). Each graph 
with n vertices and m edges is present in the four different ensembles preserving n and m, but varying the 
resulting topology according to a generative mechanism. This is performed to focus the analysis on ensemble 
features proper of the structural organization, without considering effects due to the size of the graphs. 

The first ensemble of graphs contains PCNs, directly obtained from the 3D native structures resolved for 
the E. coli proteome [36ll37j. Each vertex is defined as the alpha carbon of a residue; edges are added among 
two residues if their Euclidean distance is within the [4,8] A range. This choice is justified by observing the 
densities of native contact length in Fig. respectively when considering all contacts below 8 A and the 
herein adopted filtered version. It is possible to note a peak right before 4 A, which corresponds to trivial 
contacts due to closeness of the residues on the backbone EH; such contacts do not alter the typical network 
architecture of PCNs and therefore they are not considered in our PCN graph representation. 



012345678 
Euclidean distance among residues (A) 


Figure 1: Density of Euclidean distances among native contacts in PCNs. 

The second ensemble of networks is elaborated by using the recipe of Bartoli et al. [^, which considers a set 
of edges added deterministically due to the backbone and additional edges inserted according to a probability 
that scales linearly with the sequence distance of residues. For the third ensemble we use networks generated 
according to the recently-proposed scheme by Sah et al. m- Such a generation mechanism produces modular 
networks with controlled (i.e., used-defined) modularity and degree values. Such a generation mechanism 
is inspired by the fact that modular structures seem to be ubiquitous in biological networks; modularity is 
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considered to be at the basis of resilience and adaptability features of biological networks. It is well-known 
that PCNs are modular, i.e., they posses a well-defined community structure [SS]. Accordingly, the third 
ensemble of Sah et al. networks is generated by copying modularity and degree from the considered PCNs. 
The contemporary presence of networks in which modularity (if any) comes up as an emergent property 
and Sah et al. networks, in which modularity is a built-in property, will help us to check if the peculiar 
PCN spectral and topological properties are a direct consequence of their pronounced modularity or not. 
Finally, the fourth ensemble of networks is constituted by 100 graphs generated according to the herein 
proposed mechanism, referred to as LMGRS networks, whose first step consists in a variant of the Bartoli et 
al. model. In this adaptation, the linear scaling of the probability of the edges with the distance in sequence 
is replaced by the empirical frequencies measured in the PCN ensemble. As it is discussed later, the LMGRS 
are successively reconfigured in a iterative fashion to obtain a new ensemble of networks, shortened in the 
following as LMGRS-REC. The proposed LMGRS generation and LMGRS-REC reconfiguration methods 
are fully described in Sec. |A.1| 


3 Results 


The four ensembles are first evaluated in terms of mesoscopic properties elaborated from the spectra of the 
normalized graph Laplacians (see Sec. A.2 for technical details). Fig. |^shows, respectively, the characteristic 
HT decay and the ensemble spectral densities. The HT decay analysis ( |2(a)[ ) offers insights on an ensemble 
in terms of characteristic diffusion time. The analysis takes into account the varying-size character of the 
considered networks. From the plot it is possible to note that LMGRS introduces a considerable improvement 
with respect to Bartoli et al. and Sah et al. ensembles: the former decays around t ~ 350 while the 
latter decay much faster at t ~ 100. However, the PGN trend is not accurately approximated yet. This 
improvement of several order of magnitudes in the characteristic HT decay time can be explained by focusing 
on the spectral densities shown in Fig. |2(b)[ LMGRS ensemble density possesses clear similarities with the 
one of Bartoli et ah, being the two based on the same algorithmic template. Nonetheless, by highlighting the 
lower band of the spectra, it is possible to note some important differences for a specific region (in-between 
0 and 0.2) containing eigenvalues related to the modular structure of networks. LMGRS ensemble offers a 
better approximation for those small eigenvalues, which explains the significant improvement observed for 
the HT decay. 

Now we move to the analysis of the ensembles by considering the representation of each graph as a 
numeric vector containing suitable features that characterize different aspects of the network topology (see 
Sec. A.2). To offer a synthetic visualization, in Fig. we show the pairwise relations among the first 
three principal components (PG) elaborated via the principal components analysis (PGA) of such a vector- 
based representation of the considered networks; data is standardized prior to PGA processing. The first 
three PCs explain ~ 92% of the entire data variance (PCI ~ 39%, PC2 ~ 30%, and PC3 ~ 23%) and 
therefore they are considered as the signal part of information; the component loadings (Pearson correlation 
coefficients between original descriptors and components) are reported in Tab. Loadings on the PC offer a 
particularly interpretable scenario, where PCI is mostly explained by the path distribution (ACC and ASP) 
and the local clustering (ACL). As expected ACC negatively scales with both ASP and ACL, so pointing 
to the fact that ACL decreases the efficiency of signal transmission across the network (positive correlation 
with ASP). Thus, high values of PCI corresponds to architectures with elevated characteristic length (slow 
information transmission), while low values of PCI point to graphs with elevated closeness centrality (ACC) 
and thus relatively efficient signal transmission. PC2 is mainly correlated with MOD, A and H, with A 
going in the opposite direction with respect to the other two descriptors. This corresponds to the fact the 
regularity of a graph decreases as the modularity increases; it is also well-known that modularity affects 
random walks behavior, explaining the positive correlation with H. PC3 is entirely described by the spectra 
of the adjacency and Laplacian matrices (respectively indicated by EN and LEN). 

The PCs are linearly independent by construction, so the above results clearly indicate that the dataset 
can be described by three autonomous topological features: 1) path length and local clustering (PCI); 
mesoscopic modularity (PC2); and 3) spectral properties (PC3). The particular mixing of these independent 


4 










(a) HT slopes decay. 



Figure 2: Ensemble HT slopes decay (13) for the considered graphs [2(a)| and related Laplacian spectral 
densities |2(b)| The proposed LMGRS model clearly denotes more similar characteristics with respect to 
PCNs in terms of HT decay. Analogously, the LMGRS model induces a density of eigenvalues more similar 
to PGNs in the lower bands (see detailed plot), suggesting that the community structure is more suitably 
approximated. The Sah et al. model instead does not mimic the spectral density, even if the modularity 
level and degrees have been copied from the original PCNs. 


features varies across the different ensembles. Let us now focus on the PGA subspace spanned by PC1-PC2 
(Fig. 3(a)). It is possible to note that LMGRS ensemble introduces an improvement in PCI, which as 


explained before, encodes contributions in terms of path distribution and local clustering. A very interesting 
scenario can be observed when considering the projection given by PC1-PC3 (Fig. |3(b)[ ). In fact, when PC2 
is not considered Sah et al. and LMGRS become very similar to each other, and entirely different from the 
ensemble of Bartoli et al. To summarize, it is worth pointing out that the average Euclidean distance among 
the LMGRS and PCNs networks represented in the three-dimensional PGA space is significantly inferior 
(p < 0.0001) with respect to the distances among Bartoli et al. and PCNs (3.13 vs 4.69 with standard 
deviations 0.75 and 0.68, respectively). 


Table I: Principal component loadings. 



PCI 

PC2 

PC3 

MOD 

0.26148 

0.78872 

0.13409 

ACC 

-0.97835 

-0.15695 

-0.II588 

ASP 

0.92838 

0.26494 

0.09270 

ACL 

0.89098 

-0.22533 

-0.12227 

EN 

0.01862 

0.27707 

0.95810 

LEN 

0.04354 

-0.27981 

0.94213 

H 

0.05I7I 

0.99519 

-0.04393 

A 

0.09073 

-0.84835 

0.06463 


From PGA space snapshots we deduce that the LMGRS ensemble is a considerable improvement with 
respect to the others, while there is still a gap to be filled with respect to PCN. Notably, LMGRS networks 
present a too small characteristic path length - the small-world signature is too strong. This fact explains 
the differences observed for what concerns path distribution and modularity, since in fact the path efficiency 
and modular properties are two conflicting features in networks, which usually survive in the same network 
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Figure 3: PCA of the topological descriptors calculated on the four ensembles of protein graphs. LMGRS 
model is an improvement with respect to Bartoli et al. [S] also considering classical TD. 


in terms of a trade-off (modular organization is progressively lost as the network becomes more and more 
efficient in terms of shortest paths). To this end, as mentioned earlier, we consider another ensemble derived 
by post-processing LMGRS networks with the edge reconfiguration process described in Sec. denoted 
as LMGRS-REC. The reconfiguration process is targeted to rewire edges with high edge-betweenness, since 
those edges have a direct impact on path efficiency, and accordingly also on the modular organization. Each 
graph of the LMGRS ensemble is reconfigured according to the following process (see Sec. for details). 
At each iteration, the edge with maximum edge-betweenness is removed and it is re-attached to two randomly 
chosen vertices at a backbone distance given by the empirical distribution shown in Eig. This process 
is repeated until a suitable convergence criteria is met. In our case, we considered a number of iterations 
(50) that resulted in a statistically significant improvement of the ensemble features that we observe. In 
Fig. |4(a)| we show the detailed changes of the TD for the LMGRS and LMGRS-REG with respect to the 
PCNs. The figure reports, for each descriptor, the average absolute difference calculated for each graph of 
the respective ensembles with respect to the PGN graphs; standard deviations are reported as vertical bars. 
Results show that the reconfiguration algorithm performs well as for statistical significance of differences 
with PCN, assessed via t-test with the usual 5% threshold. In particular, as desired reconfigured networks 
denote more similar ASP {p < 0.0025) and ACC {p < 0.0001). As expected, such improvements for the 
shortest paths have a direct influence on the global modularity. In fact, MOD is significantly improved 
(p < 0.0018). This is a direct consequence of the fact that path efficiency and modularity are features to be 
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considered in a trade-off. ACL similarity improves as well {p < 0.0059), denoting a better approximation 
of the local cluster structure of PCNs. It is important to note that differences for EN {p < 0.0621) and 
LEN {p < 0.1134) are not statistically significant (especially those for LEN). This fact tells us that spectral 
properties of the reconfigured networks are not significantly altered. Fig. |5(b)| offers a visual confirmation 
of this fact. In fact, the spectral densities for LMGRS and LMGRS-REC reported in the figure are almost 
identical. However, it is worth discussing the HT slopes shown in Fig. |5(a)| From the figure, it is possible 
to notice a slight divergence among LMGRS and LMGRS-REC for large-time instants. This is due to 
the difference in magnitude of the first non-zero eigenvalue of the normalized Laplacian, which particularly 
influences the asymptotic HT behavior. Such a difference is a byproduct of the designed edge reconfiguration 
algorithm, which focuses on rewiring edges with high edge-betweenness: those are most likely connections 
among different densely connected communities. In graph-theoretical terms, LMGRS-REC networks are 
characterized by a lower conductance with respect to the LMGRS ensemble. In fact, the conductance is 
directly related to first non-zero eigenvalue of the Laplacian (details not shown here). Finally, differences for 
both H (p < 0.0030) and A (p < 0.0069) are significant as well. 
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(a) Average absolute differences. 


(b) ASP over iterations. 


Figure 4: Average differences for each TD with respect to PCN (4(a) I and their standard deviations. We are 
able to modify, among the other factors, the small-world character of the generated networks without signif¬ 
icantly affecting the spectra of the adjacency (EN) and Laplacian (LEN) matrices. Statistical significance 
of differences is assessed via t-test. Fig. |4(b)| shows the ASP of a sample graph during the reconfiguration 
process. 


To conclude this section, we discuss the results shown in Fig. The figure shows the PGA performed by 
considering the (standardized) HG coefficients ([^ of the first 100 time instants. HG coefficients include, in 
addition to the spectra, also the information of the eigenvectors of the normalized Laplacian, which encode 
the arrangement of the vertices in a given vector space. In the literature (see Ref. |44) and references therein) 
this is called localization of the eigenvectors and it is usually exploited for clustering purposes. While variance 
is almost entirely explained by PGl (~ 99.8%), for our purposes we consider the first three PGs. From PGl- 
PG2 it is possible to note that PGN, LMGRS, LMGRS-REG, and Sah et al. have a very well-defined and 
compact configuration in the PGA subspace. Since Sah et al. posses a well-defined community structure, we 
deduce that when considering the information of the eigenstructure of the normalized Laplacians, all four 
ensembles (i.e., PGN, LMGRS, LMGRS-REG, and Sah et al.) posses striking similarities. However, Bartoli 
et al. ensemble denotes a very disorganized configuration regardless of the considered PGA subspace (i.e., 
PG1-PG2 or PG2-PG3), which could be intended as a symptom of a weaker modular organization of the 
vertices. 
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Figure 5: Same as Fig but including also reconfigured LMGRS. 
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Figure 6: PCA of the first 100 time instants of the HC coefficients elaborated from the ensembles. 


4 Conclusions and future directions 

In this paper, we have proposed a two-step generative model for protein contact networks. For the first step, 
we partially took inspiration from the work of Bartoli et al. [5], whose idea was to generate contact graphs by 
first adding backbone contacts deterministically (considering adjacent residues along the sequence). Succes¬ 
sively, a number of additional contacts were added with a probability inversely proportional to the residue 
distance along the sequence. Here we modified this part by considering the actual empirical probability dis¬ 
tribution of contacts with respect to the sequence distance, derived from an ensemble of E. coli proteins. We 
analyzed our generative method by considering three additional ensembles composed of 100 varying-size pro¬ 
tein contact networks. We focused on a mesoscopic analysis, that is, we primarily investigated the soundness 
of the models by considering information derived from the eigendecomposition of the normalized Laplacian. 
Results showed that the proposed method approximates with better precision the behavior of actual protein 
contact networks in terms of characteristic diffusion time. We considered also several common topological 
descriptors. This last analysis pointed out that our method, as well as the others, does not approximate 
sufficiently well the path distribution. To this end, we designed an edge reconfiguration algorithm to be 
used as the second step of the proposed generative method. We then generated an additional ensemble of 




















reconfigured networks, which showed statistically significant improvements with respect to the initial one. 

We considered an ensemble composed of graphs synthesized according to a recently-proposed mechanism 
m, designed to construct a network with specified modularity and degree profiles. Notably, we reproduced 
the modularity and degree values from the actual protein contact networks herein considered. Results 
demonstrate that modularity, when hardcoded into the networks, does not explain the actual architecture 
of proteins. In fact, we concluded that modularity should be considered as an emergent property of such 
networks, which is suitably optimized in a trade-off with the conflicting feature of path efficiency. In our 
model, modularity emerged from the peculiar PCNs mesoscopic wiring obtained from their empirical contact 
distribution at increasing distance length: a simple linear decrease in contact frequency at increasing sequence 
distance does not allow to reach the typical modularity of real proteins. The fine tuning of long-range 
contacts allows for directly intervening on both modularity and path efficiency balance, so confirming the 
crucial importance of long-range contacts in folding process uniiia. 

A sound generative mechanism for protein contact networks is of utmost importance in current researches 
in protein science. In future works, in addition to the improvement of the herein proposed method, we also 
plan to tackle the problem in a data-driven generative learning scenario, for instance using generative (deep) 
neural networks |32| . The possibility to learn in a data-driven fashion an effective and sound model for 
protein contact networks would allow to easily generalize other instances of such networks. This perspective 
could be interesting also for protein engineering purposes m- The theoretical study of networks promises 
to pave the way for the discovery of universal principles at the basis of biological organization as well as 
instructing the generation of technological devices. 

A Methods 

A.l The proposed generative method for synthesizing protein contact networks 

In this section we describe the proposed generative method for PCN. Instead of merely presenting the 
mechanism itself, we first discuss an important fact related to the distance on sequence of amino acid 
residues and their relations on the native 3D structure of proteins. This initial discussion, in our opinion, is 
relevant for the purpose of designing and, most importantly justifying, a generative mechanism for PCN. 

Native contacts in folded proteins are in one way or another constrained by the covalent bonds due to the 
backbone. Therefore, a first interesting question that one would ask when designing a generative mechanism 
is “what is the effect of the backbone on the degree distribution of a PCN”. To provide an answer to such a 
question, we first define the notion of short range (SR) and long range (LR) contacts, that is, native contacts 
whose residues are, respectively, close and distant on the sequence (backbone). We chose 12 residues as 
threshold for SR contacts ED. Fig. 0 shows the two separate degree distributions elaborated from the 
considered ensemble of varying-size PCNs. SR contacts denote a clearly different distribution with respect 
to those that are LR; the latter is vaguely compatible with a power-law. 

Considering this fact and that PCNs do posses a modular architecture, one would be tempted to postulate 
a striking rule such as “SR contacts are intra-module while LR are inter-module links”. If this rule would be 
correct, it would be possible to design a generative mechanism accordingly, e.g., by connecting intra-module 
and inter-module links according to their specific (empirical) distributions; although the number of modules 
should be defined a priori. Nevertheless, such a possibility seems to be weakly supported by the following 
test. In Fig. [^we show a graphical representations of two PCNs, denoted as “JW0058” and “JW0I79”. 
Those two networks contain roughly the same number of amino acid residues (around a thousand); JW0058 
is made of two chains while JW0179 is derived from a single-chain polymer. To verify the above stated 
hypothesis, we need to consider a suitable criterion to generate a partition (i.e., to group the vertices into 
modules). In our case, we have chosen to use the partition having maximum modularity as computed with 
the algorithm presented in Ref. [5]. Results in Fig. [^demonstrate that intra-module links (solid lines) are 
SR (drawn in red), in both cases, only around 55% of the times. This fact - that has been verihed for a larger 
number of PCN - suggests to reconsider the possibility to follow such a SR/LR contacts characterization with 
respect to intra/inter module links. In addition, we found in our data that there is no trivial relation among 
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the distance on sequence and the Euclidean distance among residues in the 3D space (r ~ 0.162, details not 
shown here). These facts find confirmation by considering the enormous research effort in predicting native 
contacts starting from the sequence [slIIilEaEilEaliniliolllSIlielllg]. 




Degree Degree 

(a) SR contacts. (b) Distribution of LR contacts. 


Figure 7: Degree distribution of SR (7(a) I and LR (7(b) I contacts. Both distributions are provided in lin-log 
plots to improve visualization. SR contacts are determined by considering a distance on the sequence lower 
than or equal to 12 residues. 


Let us describe the proposed generative mechanism. Algorithm conveys the pseudo-code of the proce¬ 
dure. The algorithm takes inspiration from the mechanism introduced by Bartoli et al. [5]. Firstly, edges 
are deterministically added among any two residues at distance two on the sequence. This provides the 
definition for backbone contacts. The main difference with Bartoli et al. [S] is that, to add all remaining 
non-backbone contacts, we use the empirical distribution shown in Fig. instead of a linear function of 
the sequence distance. As shown in the results, this straightforward modification resulted in a considerable 
improvement under many aspects. 


Algorithm 1 Pseudo-code of the proposed generative algorithm. 

Require: Number of vertices, n, and edges, m 
Ensure: A graph G = (V,£^) with n = |V| and m = \S\ 

1: Add n vertices in V with unique, progressive, numerical identifiers 
2: Add backbone contacts in 8: connect all vertices Vi and Vj for which \i — j\ =2 
3: Loop to add all remaining non-backbone contacts rv 
4: while \€\ < m do 

5: Select two non-connected vertices Vi and Vj with probability p{\i — j\) given by their distance \i — j\ according to the 

empirical distribution in Fig. 

6: Add the undirected edge e = (vi,Vj) in S 

7: end while 
8: return G= (V,5) 


We now introduce the second step of the proposed generative mechanism, which implements the edge 
reconfiguration. Given a graph G = (V,£’), the herein introduced reconfiguration step is primarily meant 
to lower the small-world signature in G. This is done by iteratively rewiring edges in E according to their 
edge-betweenness value. The pseudo-code of the edge reconfiguration algorithm in delivered by Algorithm 
The reconfigured graph G is obtained at the end of the reconfiguration loop. Please note that we insure 
connectedness for all G. The loss of small-world signature in G is primarily verified with the ASP increase 
(see Fig. |4(b)| for an example), which is a consequence of the targeted rewiring of edges with maximum 
edge-betweenness. 


10 


























































Algorithm 2 Pseudo-code of the proposed edge reconfiguration algorithm. 

Require: A graph G = (V,^) with n = |V| and m = \£\ 

Ensure: A modified graph G = (V, S) with n = |V| and m = \S\ 

1: loop 

2: Calculate the edge-betweenness measure for all edges in S 

3: Let Cmax be the edge with maximum edge-betweenness. Remove Cmax from S 

4: Select two non-connected vertices Vi and Vj with probability p{\i — j\) given by their distance \i — j\ according to the 

empirical distribution in Fig. 

5: Add the undirected edge e = (vi,Vj) in S 

6: end loop when stop criterion is met 
7: return G = (V,S) 
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(b) JW0179 modules/links organization. 


Figure 8: Classification of contacts by considering the SR/LR typology and the intra/inter module arrange¬ 
ment. The partition is derived by using the maximum modularity criterion. Vertex assignment to modules is 
represented using different colors; numerical module identifies are drawn in the legend and in the correspond¬ 
ing vertex labels. Solid lines denote intra-module links while dashed lines inter-module links. Black links 
denote LR contacts, while red links are SR. Please note that the length of the links in the figures respects 
the actual Euclidean distances of contacts. The assumption that LR contacts are mostly inter-module links 
(and accordingly, SR contacts are mostly intra-module links) seems to be disproved by those examples. 
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Distance on sequence 

Figure 9: Empirical distribution of contacts considering the distance on sequence (backbone distance). The 
distribution is elaborated from the entire ensemble of PCN without considering contacts at distance one and 
two on the sequence; backbone contacts are added deterministically. 
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A.2 Graph characterization and heat kernel 

In this appendix we provide essential technical details regarding the graph characterizations used in this 
paper to study the four ensembles of variable-size graphs representing proteins. 

The first characterization employs classical topological descriptors, which include statistics of the de¬ 
grees/shortest paths and also some elaborations of the graph spectra. We consider the modularity (MOD) 
[SllH] for quantifying the presence of a global community structure - please note that we consider the value 
associated to the partition with maximum modularity. Then we consider the average closeness centrality 
(ACC), average shortest path (ASP), and the average clustering coefficient (ACL) [T3j; the energy (EN) and 
Laplacian energy (LEN) of the corresponding spectra [511; the ambiguity (A) |3S], which expresses the degree 
of irregularity of the topology; and finally the 2-order Renyi entropy of a stationary Markovian random walk 
(H) [Hj. 

In what follows, we describe the heat kernel and the derived invariants used to characterize the considered 
ensembles. The following material is principally reorganized from Ref. |38j . Let G = (V, £") be a graph with 
n = |V| vertices and m = \£\ edges. Let be the adjacency matrix defined as = I if there is an edge 

between vertices Vi, Vj € V; Aij = 0 otherwise. Let us define the degree of a vertex Vi as deg(ui) = Aj- 
In addition, let us define D as a diagonal matrix of degree: Da = deg(wi). Let L be the Laplacian matrix 
given by L = D — A; the normalized Laplacian matrix as L = As a consequence, L is 

symmetric and positive semi-definite and therefore it has non-negative eigenvalues. The eigendecomposition 
of the Laplacian is expressed as L = <i)A<I>^, where A is the diagonal matrix containing the eigenvalues 
arranged as 0 = Ai < A 2 < ... < A„ < 2; $ contains the corresponding (unitary) eigenvectors as columns. 

The heat equation [IS] associated to L is given by 


dUt 

dt 


= -LH, 


( 1 ) 


where Ht is a doubly-stochastic nx n matrix, called heat matrix, and t is the time variable. Eq. [^describes 
the diffusion of heat/information across the graph over time. Being doubly-stochastic, the heat matrix 
possesses a uniform stationary distribution. It is well-known that the solution to Q is 


Ht = exp(-Lt), 


( 2 ) 


which can be solved by exponentiating the spectrum of L: 

n 

Ht = $exp(-At)$^ = '^exp{-Xit)(j)i(j)f. (3) 

The heat trace (HT) of is an invariant feature that is given by 

n 

HT(t) = Tr(Ht) = ^ exp(-AA), (4) 


which thus takes into account only the eigenvalues of L. The heat content (HC) of Ht is defined by considering 
also the eigenvectors of L: 

n 

HC(t) = ex-p{-\it)(j)i{v)4>i{u), (5) 

ueVuev u^v u^v i—i 

where with (/i(i') we indicate the value related to the vertex v in the zth eigenvector. 

Eq. [^can be described in terms of power series expansion, 

00 

HC(f) = ^ qmf^. (6) 

m=0 
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By using the McLaurin series for the exponential function, we have 


( \ Xmxr] 

exp(-A.i) = 5: 

m' 


m—0 


which substituted in Eq. gives: 


^ oo n /_^ \mj.r, 

HC(t) = EEE exp( Xit)(j)^{v)4>i{u) = 

uGVuGV i—1 m—0 uGV uGV i—1 


(7) 

( 8 ) 


The Qm coefficients in are graph invariants (called heat content invariants, HCI) that can be calculated 
in closed-form by using Eqs. |^and[^ 

( 9 ) 

i=l \«GV / 

In what follows, we provide an argument to characterize the heat trace @ as a property of a homogeneous 
ensemble of graphs. The heat trace Q of a graph G = {V,£), with n = |V|, can be expressed as 

n n 

HTG(t) = ^ exp(-Ait) = 1 + E exp(-Aif). (10) 

i=l i=2 


where Xi are the eigenvalues of the normalized Laplacian of G. Let us define an ensemble C of graphs, in which 
all graphs share a common characteristic spectral density. Such spectra can be synthetically described by 
considering the spectral density of C. Accordingly, we can consider the eigenvalues as i.i.d. random variables, 
Xi, assuming values according to the spectral density of the ensemble; note that Ai assumes deterministically 
the value 0. The HT of a generic graph G € C,n = |V|, can be written as: 


HTG(t; n) = 1 + ^ exp(-At) = 1 + E 6xp(-At). 


( 11 ) 


The last step in Eq. 
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is carried out by considering that, since the Xi are assumed as i.i.d., their values 
can be expressed as n realizations of a single random variable, A, characterized by the same probability 
density function. For a fixed value of time t, we can define the ensemble HT, HTc(n;f), as the mean HT 
over all graphs of the ensemble C with varying size n. This quantity is given by: 


HTc(n;t) = (HTG(t;n))c = 1 + ^(exp(-At))c = 1 + {n - l)(exp(-At))c. (12) 

1=2 


Hence, HTc(n;t) can be expressed as a linear function of the graph size 


HTc(n;t) = 1 - ac{t) + ac{t) ■ n ac{t) ■ n, (13) 

where ac{t) = (exp(—At))c € [0,1] is a time-dependent slope that is characteristic for the entire ensemble 

C. 
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